DATA 202 - Week 14

Multivariate regression

Nathan Alexander, PhD

Center for Applied Data Science and Analytics

Part I: Context

Now that we have many foundational elements identified and practiced - such as generating code to explore data, cleaning data for analysis, and some elements of theory construction - we can begin focusing on some of the important technical components of model building and analysis: interpretation.

  • Interpretation relies very heavily on both your research question and the subsequent empirical study.

  • While your research question may be based on a host of factors, your empirical study relies on a combination of:

    • Theoretical frameworks

    • Analytic method

    • Interpretations

A suggestive and indicative mode of the triangulation method from Tzagkarakis & Kritas (2023).

Research questions

The below research questions highlight the intersection of social justice issues in multiple variable quantitative analysis. Keep in mind that these questions can be further refined and tailored to specific contexts or issues of interest within the realm of social justice.

  1. How does income inequality and geographical location affect access to quality education?

  2. What disparities in the criminal justice system by race and gender?

  3. How does gender discrimination and age impact career advancement in the workplace?

  4. What are the effects of housing policies and income on residential segregation and access to affordable housing?

  5. How does healthcare accessibility and affordability vary across different socioeconomic groups?

Sample analysis

Let us continue with a sample analysis.

We will assume that state data collected for a sample of 100 randomly selected cities requesting funding after the approval of a new bill on affordable housing. The data set includes three key variables.

Research question

What is the relationship between state funding for affordable housing initiatives and the availability of new affordable housing units?

Details about each variable are provided below:

  • city is a marker (which matches the data index) used to indicate a randomly selected city.

  • funding is the total amount of funding provided to families (in thousands of dollars) in a given 3-week period

  • housing_availability is the average of city housing units allocated over the same funding period

  • advocacy is the average number of calls to the state representatives’ hotline four months prior

The advocacy variable was generated as a result of a similar study conducted in a neighboring state, which noticed that there was a potential lag-relationship between advocacy and funding allocations approved at the state-level.

head(data)
  city funding housing_availability advocacy
1    1  251.32                34.60    21.15
2    2  422.71                50.32    35.27
3    3  418.33                45.47    28.54
4    4  423.08                46.43    27.23
5    5  503.13                55.97    36.67
6    6  428.78                60.77    27.49
tail(data)
    city funding housing_availability advocacy
95    95  248.48                55.39    18.27
96    96  385.40                58.23    33.48
97    97  314.17                67.69    26.19
98    98  222.00                52.38    22.95
99    99  317.35                57.37    18.10
100  100  463.09                59.65    27.02
summary(data)
      city           funding      housing_availability    advocacy    
 Min.   :  1.00   Min.   :216.2   Min.   :34.60        Min.   :16.19  
 1st Qu.: 25.75   1st Qu.:280.4   1st Qu.:48.54        1st Qu.:22.52  
 Median : 50.50   Median :343.4   Median :54.97        Median :27.04  
 Mean   : 50.50   Mean   :360.4   Mean   :54.54        Mean   :26.63  
 3rd Qu.: 75.25   3rd Qu.:437.7   3rd Qu.:59.55        3rd Qu.:30.30  
 Max.   :100.00   Max.   :547.4   Max.   :77.86        Max.   :37.34  

Exploration

We can use some base-R commands to get a quick summary of each variable.

# get plots of variables
hist(funding)

hist(housing_availability)

Exploration

# get summary statistics for variables
summary(funding)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  216.2   280.4   343.4   360.4   437.7   547.4 
summary(housing_availability)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  34.60   48.54   54.97   54.54   59.55   77.86 

We can also produce quick plots to examine the relationship between each variable.

Here, we include code to get the correlation coefficient.

# perform correlation analysis
plot(funding, housing_availability)

cor(funding, housing_availability)
[1] 0.266359
plot(advocacy, funding)

cor(advocacy, funding)
[1] 0.4757307
plot(advocacy, housing_availability)

cor(advocacy, housing_availability)
[1] 0.1444811

Interpretation

First, researchers decided to run a linear regression model on housing_availability and funding.

# perform linear regression analysis
model1 <- lm(housing_availability ~ funding)

# summary of the regression model
summary(model1)

Call:
lm(formula = housing_availability ~ funding)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.1793  -5.9060  -0.6551   5.0543  22.4049 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 45.49080    3.41826  13.308  < 2e-16 ***
funding      0.02511    0.00918   2.736  0.00739 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.579 on 98 degrees of freedom
Multiple R-squared:  0.07095,   Adjusted R-squared:  0.06147 
F-statistic: 7.484 on 1 and 98 DF,  p-value: 0.007391
Plot the data and regression line
ggplot(data, aes(x = funding, y = housing_availability)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "City Funding", y = "Housing Availability", title = "Relationship between City Funding and Housing Availability")

One researcher, however, suggested that a more robust regression analysis should be used with OLS techniques. Robust regression analysis, as you may recall, helps us reduce outlier effects.

Note: we need to load the MASS package and library to run the following code.

ols <- lm(housing_availability ~ funding)
summary(ols)

Call:
lm(formula = housing_availability ~ funding)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.1793  -5.9060  -0.6551   5.0543  22.4049 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 45.49080    3.41826  13.308  < 2e-16 ***
funding      0.02511    0.00918   2.736  0.00739 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.579 on 98 degrees of freedom
Multiple R-squared:  0.07095,   Adjusted R-squared:  0.06147 
F-statistic: 7.484 on 1 and 98 DF,  p-value: 0.007391
opar <- par(mfrow = c(2, 2), oma = c(0, 0, 1.1, 0))
plot(ols, las = 1)
par(opar) # we use the par() function to restore graphical parameters to their original values

From this analysis, we see that a few observations are possibly problematic to our model.

We can explore some of these observations in more detail.

data[c(12, 50, 73), 1:4]
   city funding housing_availability advocacy
12   12  396.66                77.86    30.15
50   50  470.96                75.79    22.38
73   73  217.93                68.29    27.94

The three cities noted (and there may be others) have large residuals.

We can examine these in more detail.

distance <- cooks.distance(ols) # we get a measure of the Cook's distance values.
res <- stdres(ols)
a <- cbind(data, distance, res)
a[distance > 4/100, ]
   city funding housing_availability advocacy   distance       res
1     1  251.32                34.60    21.15 0.04984715 -2.029432
7     7  216.20                64.53    17.56 0.04561739  1.614402
12   12  396.66                77.86    30.15 0.04014294  2.626744
16   16  495.17                73.25    25.20 0.05225347  1.813880
50   50  470.96                75.79    22.38 0.05837888  2.179622
73   73  217.93                68.29    27.94 0.07252929  2.053558
75   75  243.32                67.02    20.18 0.04371737  1.820393
78   78  236.61                36.76    16.89 0.04260811 -1.734093

The decisions were made based on the following notes:

  • Cook’s distance cooks.distance() provides a measure of the influence of a data point when performing regression.

  • stdres standardized the residuals from our model

  • cbind() attaches the two measures to our data frame

We can use a cutoff point \(4/n\) where \(n\) is the sample size recommend by others to select the values to display.

We then get the absolute value of the residuals (remember that the sign does not matter in distance), and we print the observations with the highest residuals (here we focus on the top 10 values).

absres <- abs(res)
data1 <- cbind(data, distance, res, absres)
assorted <- data1[order(-absres), ]
assorted[1:10,]
   city funding housing_availability advocacy   distance       res   absres
12   12  396.66                77.86    30.15 0.04014294  2.626744 2.626744
50   50  470.96                75.79    22.38 0.05837888  2.179622 2.179622
88   88  317.76                35.29    20.73 0.02780093 -2.131960 2.131960
25   25  286.74                70.57    22.02 0.03637332  2.100555 2.100555
73   73  217.93                68.29    27.94 0.07252929  2.053558 2.053558
1     1  251.32                34.60    21.15 0.04984715 -2.029432 2.029432
85   85  279.11                36.86    19.93 0.03027410 -1.839822 1.839822
75   75  243.32                67.02    20.18 0.04371737  1.820393 1.820393
16   16  495.17                73.25    25.20 0.05225347  1.813880 1.813880
78   78  236.61                36.76    16.89 0.04260811 -1.734093 1.734093

We now run our robust regression analysis.

We do this by using the rlm() function in the MASS package.

There are several weights that can be used for the iterated re-weighted least squares technique (IRLS)1.

rrmodel <- rlm(housing_availability ~ funding, data = data)
summary(rrmodel)

Call: rlm(formula = housing_availability ~ funding, data = data)
Residuals:
     Min       1Q   Median       3Q      Max 
-17.9198  -5.4416  -0.3424   5.2609  22.8048 

Coefficients:
            Value   Std. Error t value
(Intercept) 45.7779  3.5798    12.7880
funding      0.0234  0.0096     2.4328

Residual standard error: 8.213 on 98 degrees of freedom

The default weight is the Huber weight.

Huber weights are a type of weight function used to downweight or mitigate the influence of outliers on the estimation procedure.

In traditional least squares regression, all data points are given equal weight, and the estimation procedure is sensitive to the presence of outliers. The use of weights in our robust regression model aims to provide more robust estimates by assigning different weights to the observations, giving less influence to outliers.

hweights <- data.frame(city = data$city, resid = rrmodel$resid, weight = rrmodel$w)
hweights2 <- hweights[order(rrmodel$w),]
hweights2[1:15,]
   city     resid    weight
12   12  22.80484 0.4843946
50   50  18.99708 0.5814916
25   25  18.08570 0.6107904
88   88 -17.91981 0.6164041
73   73  17.41506 0.6343102
1     1 -17.05588 0.6476276
16   16  15.89084 0.6951638
75   75  15.55123 0.7103366
85   85 -15.44584 0.7151312
43   43  14.97271 0.7377868
97   97  14.56416 0.7584838
78   78 -14.55183 0.7590662
9     9  14.36036 0.7692537
7     7  13.69553 0.8065877
33   33 -12.74093 0.8669457

Huber weights assign larger weights to observations that are close to the regression line and smaller weights to observations that deviate significantly from the line. The weight assigned to each observation depends on its residuals (the difference between the observed values and the predicted values).

Causality

Despite our work on the initial model, the issue of causality needs to be discussed.

There are a few considerations that need to be taken into account:

  • Confounding variables: There may be other factors that influence the model apart from city funding. For example, economic conditions, housing availability, and social policies can also play significant roles. Failing to account for these confounding variables may lead to erroneous conclusions about the causal relationship.

  • Reverse causality: The relationships can be bidirectional. Higher housing availability rates may lead to increased city funding directed at addressing the issue. Thus, it’s possible that the relationship is driven by reverse causality, where higher levels of housing availability cause increased funding rather than the other way around.

  • Omitted variable bias: There may be unobserved or unmeasured factors that affect both city funding and housing availability. Failing to include these variables in the analysis can lead to omitted variable bias, potentially distorting the estimated relationships.

  • Ecological fallacy: Analyzing aggregated data across the state- and city- levels may not capture the correct level of nuances within the relationship. Aggregating data can lead to an ecological fallacy, where conclusions made at the aggregate level may not hold true at different levels.

Multicollinearity

Multicollinearity refers to a high correlation or linear relationship between two or more predictor variables in a regression model. In the case of three variables, multicollinearity occurs when there is a strong linear relationship between any pair of the three variables, making it difficult to separate their individual effects on the response variable. This can cause instability in the regression model, inflated standard errors, and difficulties in interpreting the coefficients.

Assume we updated our theoretical statement and research question and add the advocacy variable to our model.

# perform linear regression analysis
model2 <- lm(housing_availability ~ funding + advocacy)

# summary of the regression model
summary(model2)

Call:
lm(formula = housing_availability ~ funding + advocacy)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.9890  -6.1250  -0.6158   4.9763  22.3024 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 44.80516    4.77827   9.377 2.97e-15 ***
funding      0.02408    0.01049   2.296   0.0238 *  
advocacy     0.03969    0.19229   0.206   0.8369    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.621 on 97 degrees of freedom
Multiple R-squared:  0.07136,   Adjusted R-squared:  0.05221 
F-statistic: 3.727 on 2 and 97 DF,  p-value: 0.02759

Interaction effects

Next, we add an interaction term to our model.

# get a summary of the advocacy data
summary(advocacy)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  16.19   22.52   27.04   26.63   30.30   37.34 
# examine the relationship between funding and advocacy
cor(advocacy, funding)
[1] 0.4757307
# perform linear regression analysis
model3 <- lm(housing_availability ~ funding + advocacy + funding*advocacy)

# summary of the regression model
summary(model3)

Call:
lm(formula = housing_availability ~ funding + advocacy + funding * 
    advocacy)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.9963  -6.2218  -0.5457   4.8889  22.3465 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)   
(Intercept)      49.0944885 17.5511591   2.797  0.00623 **
funding           0.0117777  0.0495659   0.238  0.81268   
advocacy         -0.1236422  0.6712607  -0.184  0.85425   
funding:advocacy  0.0004576  0.0018009   0.254  0.79997   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.663 on 96 degrees of freedom
Multiple R-squared:  0.07198,   Adjusted R-squared:  0.04298 
F-statistic: 2.482 on 3 and 96 DF,  p-value: 0.06555

Please note that we may need to run additional tests or more robust models to inform interpretation.

Statistical vs. practical significance

When analyzing the relationship between state funding and housing availability, it is important to consider both statistical significance and practical significance.

Statistical significance refers to the likelihood that the observed relationship or difference between variables is not due to chance. It is determined through statistical tests, such as hypothesis testing or p-values. In this context, statistical significance would indicate whether there is evidence to suggest that state funding has a statistically significant effect on housing availability. A statistically significant result suggests that the relationship between the variables is unlikely to have occurred by random chance.

Practical significance focuses on the magnitude or practical importance of the observed relationship. It asks whether the observed effect size is meaningful or substantial in real-world terms. In the case of state funding and housing availability, practical significance would involve evaluating whether the observed impact of state funding on housing availability is large enough to have a meaningful or substantial effect on the availability of housing units.

Note, however, that while statistical significance provides evidence of a relationship, it does not necessarily imply practical importance. A statistically significant relationship may exist but have a negligible or trivial effect in practice. Conversely, a relationship may have practical significance, even if it does not reach statistical significance due to limited sample size or other factors.

Replication studies

Exploring varied statistical outputs and their significance in a social justice context requires care, both in terms of the underlying theories that relate to the variables themselves and their use across different context. An additional factor that we have discussed relates to the role of the theoretical constructions and their applicability to issues of social injustice.

More often than not, caution should take the lead when developing new models. In these instances, some variation on what is known as a replication study can become a valuable tool. A replication study is a type of study that aims to reproduce or replicate the findings of a previous study. In the context of our course, the replication frameworks can be applied to examine the relationships between variables across contexts and different populations.

There are different types of replication studies.

  • Direct replication: In this replication study type, researchers attempt to reproduce the original study as closely as possible, meaning they follow the same research design, methodologies, and data analysis procedures.

  • Partial replication: In this replication study type, researchers attempt to replicate only a portion of the original study. Often, researchers doing a partial replication study focus on a specific aspect, variable, or component of the study.

  • Conceptual replication: In this replication study type, researchers conduct a replication analysis that focuses on the same research question(s) but through the use of different methods, measures, or population groups.

While replication studies are often used to help ensure the credibility and seeming generalizations found in statistical research findings, they can also serve as a part of a broader process to examine the role of context in statistical models. Importantly, failure to replicate the findings of a study do not mean that the original study findings were incorrect or flawed. Together, these types of explorations can contribute to scientific knowledge and provide evidence to help us understand the role of theory and the practice of social justice.

Beyond regression

Researchers have access to a wide range of advanced statistical techniques and methodologies that provide deeper insights into complex relationships and patterns within data. These approaches go beyond the linear relationships examined in regression analysis and allow researchers to explore non-linear, interactive, and dynamic effects among variables. By utilizing these advanced techniques, researchers can uncover hidden patterns, make more accurate predictions, account for complex interactions, and gain a more comprehensive understanding of the phenomena under investigation.

Some of these methods often provide greater flexibility in handling missing data, dealing with outliers, and accommodating various types of data structures. Overall, the utilization of these advanced statistical techniques expands the availability of tools to consider ways to delve deeper into the complexities of their data and extract meaningful insights.

Part II: Content

Multiple Variable Analysis and Multivariate Analysis are two terms often used in statistics and research methodology to describe different approaches to analyzing data involving multiple variables. While they share similarities, there are distinct differences between these two concepts.

Multivariable vs. Multivariate

Multiple variable analysis investigates the influence of individual independent variables on a single dependent variable, while multivariate analysis explores the relationships and patterns among multiple variables simultaneously.

Multiple Variable Analysis is often used when studying the effects of specific factors, while multivariate analysis is employed to uncover broader patterns and structures within a dataset. Both approaches are valuable in data analysis, and the choice between them depends on the research objectives and the nature of the data being analyzed.

Definitions: Multiple variable analysis vs. Multivariate analysis

Multiple Variable Analysis: Multiple Variable Analysis refers to the process of examining the relationships between several independent variables and a single dependent variable. It aims to understand how each independent variable influences or predicts the dependent variable individually, while controlling for other variables. In this analysis, each independent variable is analyzed separately, often using techniques such as regression analysis or analysis of variance (ANOVA).

Multivariate Analysis: Multivariate Analysis involves the simultaneous analysis of multiple dependent and independent variables. It aims to explore the relationships and patterns among multiple variables, considering them as a whole. This analysis technique allows for the examination of complex interactions and associations between variables, providing a more comprehensive understanding of the data.

Key characteristics of multiple variable analysis

  1. Focus: Examining the impact of individual independent variables on a single dependent variable.

  2. Analytic approach: Each independent variable is analyzed separately, allowing for isolation of their effects.

  3. Purpose: To identify the individual contributions and significance of multiple variables in explaining the variation in the dependent variable.

  4. Statistical techniques: Common techniques include simple linear regression, multiple linear regression, and ANOVA.

Key characteristics of multivariate analysis

  1. Focus: Examining the relationships and interactions among multiple variables simultaneously.

  2. Analytic approach: Considering all variables together, accounting for their joint effects and potential interdependence.

  3. Purpose: To explore patterns, associations, and structures within the data, identifying underlying factors or dimensions.

  4. Statistical techniques: Common techniques include factor analysis, principal component analysis, cluster analysis, and structural equation modeling.

Examples of multivariate analysis techniques

  • Principal component analysis (PCA): PCA is used to reduce the dimensionality of data by transforming it into a new set of uncorrelated variables called principal components. R functions for PCA include prcomp() and princomp().

  • Factor analysis: Factor Analysis aims to identify latent factors that explain the correlations among observed variables. R offers functions like factanal() and psych::fa() for conducting factor analysis.

  • Canonical correlation analysis (CCA): CCA examines the relationships between two sets of variables and identifies the linear combinations of each set that have maximum correlation with each other. The CCA() function in the stats package can be used for this analysis.

  • Cluster analysis: Cluster Analysis groups similar observations into clusters based on the similarity of their characteristics. R provides various clustering techniques, such as k-means clustering (kmeans()), hierarchical clustering (hclust()), and model-based clustering (Mclust()).

  • Discriminant analysis: Discriminant Analysis aims to find a linear combination of variables that maximally separate predefined groups or classes. R offers functions like lda() and qda() for performing Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA), respectively.

  • Multivariate regression: Multivariate Regression extends simple linear regression to multiple response variables. The lm() function in R can be used for multivariate regression analysis.

  • Multivariate analysis of variance (MANOVA): MANOVA extends the analysis of variance (ANOVA) to multiple response variables simultaneously. The manova() function in R can be used for MANOVA.

  • Multidimensional scaling (MDS): MDS visualizes the similarity or dissimilarity between objects in a lower-dimensional space. R provides functions like cmdscale() and isoMDS() for performing MDS.

  • Structural Equation Modeling (SEM): SEM is a comprehensive framework for testing complex relationships among variables. R packages like lavaan and sem offer functionalities for conducting SEM.

  • Correspondence Analysis: Correspondence Analysis explores the associations between categorical variables and visualizes them in a low-dimensional space. The ca() function in the ca package is commonly used for correspondence analysis.

We will consider a few of these models in our final weeks for the course.

Part III: Code

This week, we use some standard data included in R to further discuss model interpretation.

While these data sets do not directly connect to the content of our course, they provide some useful examples to return to as they are discussed on many websites that use R and that can be found in online forums.

Each example illustrates different scenarios for interpreting linear models using the summary output. Remember to consider coefficients, standard errors, t-values, and p-values to assess the significance and direction of relationships between predictors and the response variable. Additionally, theory construction and relevant knowledge and context are crucial for a comprehensive interpretation of the results.

This data is from the 1974 Motor Trend US magazine. The data set comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). You could run similar models using data in the critstats package.

names(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"
summary(mtcars)
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  

Example 1: Simple Linear Regression

# Fit a simple linear regression model
model <- lm(mpg ~ hp, data = mtcars)

# Print the model summary
summary(model)

Call:
lm(formula = mpg ~ hp, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.7121 -2.1122 -0.8854  1.5819  8.2360 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
hp          -0.06823    0.01012  -6.742 1.79e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.863 on 30 degrees of freedom
Multiple R-squared:  0.6024,    Adjusted R-squared:  0.5892 
F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

The summary output provides information about the coefficients, standard errors, t-values, and p-values. In this case, the intercept represents the estimated baseline miles per gallon (mpg) when horsepower is zero. The coefficient for horsepower indicates the estimated change in mpg for each unit increase in horsepower.

Example 2: Multiple Linear Regression

# Fit a multiple linear regression model
model <- lm(mpg ~ hp + wt, data = mtcars)

# Print the model summary
summary(model)

Call:
lm(formula = mpg ~ hp + wt, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-3.941 -1.600 -0.182  1.050  5.854 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
hp          -0.03177    0.00903  -3.519  0.00145 ** 
wt          -3.87783    0.63273  -6.129 1.12e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.593 on 29 degrees of freedom
Multiple R-squared:  0.8268,    Adjusted R-squared:  0.8148 
F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

The summary output provides interpretation for each coefficient. For example, the coefficient for horsepower represents the estimated change in mpg for each unit increase in horsepower, holding weight constant. Similarly, the coefficient for weight represents the estimated change in mpg for each unit increase in weight, holding horsepower constant.

Example 3: Categorical Predictor

# Fit a linear regression model with a categorical predictor
model <- lm(mpg ~ factor(cyl), data = mtcars)

# Print the model summary
summary(model)

Call:
lm(formula = mpg ~ factor(cyl), data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.2636 -1.8357  0.0286  1.3893  7.2364 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   26.6636     0.9718  27.437  < 2e-16 ***
factor(cyl)6  -6.9208     1.5583  -4.441 0.000119 ***
factor(cyl)8 -11.5636     1.2986  -8.905 8.57e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.223 on 29 degrees of freedom
Multiple R-squared:  0.7325,    Adjusted R-squared:  0.714 
F-statistic:  39.7 on 2 and 29 DF,  p-value: 4.979e-09

When a categorical predictor, such as “cyl” (number of cylinders), is included in the model, R automatically treats it as a set of dummy variables. The summary output provides the coefficients for each category level (e.g., 4 cylinders, 6 cylinders, 8 cylinders). These coefficients represent the estimated difference in the response variable (mpg) compared to the reference category (usually the intercept).

Example 4: Interaction Effect

# Fit a linear regression model with an interaction term
model <- lm(mpg ~ hp * wt, data = mtcars)

# Print the model summary
summary(model)

Call:
lm(formula = mpg ~ hp * wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0632 -1.6491 -0.7362  1.4211  4.5513 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 49.80842    3.60516  13.816 5.01e-14 ***
hp          -0.12010    0.02470  -4.863 4.04e-05 ***
wt          -8.21662    1.26971  -6.471 5.20e-07 ***
hp:wt        0.02785    0.00742   3.753 0.000811 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.153 on 28 degrees of freedom
Multiple R-squared:  0.8848,    Adjusted R-squared:  0.8724 
F-statistic: 71.66 on 3 and 28 DF,  p-value: 2.981e-13

When an interaction term (e.g., horsepower * weight) is included in the model, the summary output provides coefficients for both main effects (horsepower and weight) as well as the interaction term. The interaction coefficient represents the change in the relationship between mpg and horsepower as weight increases.

Sample code discussed in class

do some exploratory analysis on the survey data in the MASS package

library(dplyr) survey survey <- as_tibble(survey)

check the structure of the data

str(survey) pairs(survey)

subset the data

survey %>% select(Wr.Hnd, NW.Hnd, Pulse, Height, Age) -> df1 df1 pairs(df1)

build our model with one indicator

mlm1 <- lm(cbind(df1\(Height, df1\)Pulse) ~ df1$Age) mlm1 <- lm(cbind(Height, Pulse) ~ Age, data = df1) summary(mlm1)

build our model with more than one indicator

mlm2 <- lm(cbind(Height, Pulse) ~ Age + Wr.Hnd + NW.Hnd, data = df1) summary(mlm2)

head(resid(mlm1)) # residuals head(fitted(mlm1)) # estimates fitted for the model

head(resid(mlm2)) # residuals head(fitted(mlm2)) # estimates fitted for the model

gather coefficients

coef(mlm2)

variance-covariance matrix

vcov(mlm2)

Next up: Week 15